for course Introduction to Open Data Science 2021
:white_check_mark:
Feeling a bit exited for this course.
I heard about this course through University of Eastern Finland (internal communication) Yammer pages.
Link to Github repository here.
date()
## [1] "Mon Dec 13 03:39:22 2021"
This week I have done quite a bit of DataCamp practices. So much so that I went on to next weeks’ DataCamp practices as DataCamp’s bottom grey-blue bar indicator of the progress was showing five distinct chapters. I did not release that one bar was for one week. This made me stress quite a bit as there didn’t seem to be an end to the DataCamp practices.
Unfortunately I have not managed to start reading the recommended reading material this week, which I really know I should.
I started the exercises too late.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 13 03:39:22 2021"
Read the data file to an object. Display the object (data table).
# Read the data
learning2014 <- read.table(
file = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt",
header = TRUE,
sep = ",")
# check first few lines of the read data... table
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
Looking at the table, note to prior Data Wrangling part: attitude should have been scaled (divided by 10) as in DataCamp exercises. Exercise instructions were unclear on this (in the Data Wranglin part).
Check data dimensionality:
dim(learning2014)
## [1] 166 7
Check data structure:
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Dataset is from a questionaire study, “international survey of Approaches to Learning”. Check here.
“gender” is sex/gender of the questionaire participant. Data type is character (chr).
“age” is the age of the participant in year (presumably). Data type is integer (int).
“attitude” is “Global attitude toward statistics” divided by 10. Data type is integer (int).
“deep” is a mean value of the following original variables: “D03”, “D11”, “D19”, “D27”, “D07”, “D14”, “D22”, “D30”,“D06”, “D15”, “D23”, “D31”, summarizing deep learning of the participant. Data type is numeric (can be numbers with decimals).
stra is a mean value of the following original variables: “ST01”,“ST09”,“ST17”,“ST25”,“ST04”,“ST12”,“ST20”,“ST28”, summarizing strategic learning of the participant. Data type is numeric (can be numbers with decimals).
surf is a mean value of the following original variables: “SU02”,“SU10”,“SU18”,“SU26”, “SU05”,“SU13”,“SU21”,“SU29”,“SU08”,“SU16”,“SU24”,“SU32, summarizing surface learning of the participant. Data type is numeric (can be numbers with decimals).
points are exam points. Data type is integer (int).
First we might need to install some packages to make some nice graphical overviews:
# uncomment if you need to install
#install.packages("GGally", "ggplot2")
Let’s start plotting some scatter plot matrix from our data:
# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
As you can see the (multi)plot above is quite busy. Lots of information to take in.
Let’s start of by noting the column names above and row names on the right side. There is no legend for gender, but checking the plot there is F and M, F is for Female and M is for Male. So this plot is color coded by gender.
Top left corner, gender column and gender row, we can see the counts of each value. F pinkish color, M cyan. We can see the female participant count is almost double the male count. Let’s check this:
table(learning2014$gender)
##
## F M
## 110 56
Top row (after gender to right) we have boxplots per each gender. Boxplot’s describes the spread/distribution of values. The box part of boxplot describes interquartile range from Q1 (25th percentile) to Q3 (75th percentile), with the median (Q2/50th percentile) of values line in between Q1 and Q3. Circles describe possible outliers (as per presuming normal distribution of the values which is not always the case).
Left side (under gender), barplots display value of row (y-axis) to count of the values (x-axis) (AND BY GENDER).
Scatter plots (plots with points) are column values (x-axis) plotted against row values (y-axis).
We also have pairwise correlation information, one variable against another. Most interesting of teh correlations are Points and Attitude are positively correlated with each other. Deep learning and surface learning are negatively correlated with each other, but only with Males. So the values of these variables are somehow intertwined or in relation to each other.
Print summaries:
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
For numeric values of each variable we have the minimum and maximum value, mean. Median and 1st Quantile and 3rd Quantile are presented in boxplots (as previously explained). If the value distribution would be normal then 50% of the values would be inbetween Q1 and Q3 values.
Multiple regression model as there are more than three explanatory variables.
Let’s have deep, stra and surf as explanatory variables for target variable points.
# create multiple regression model
my_model <- lm(points ~ deep + stra + surf, data = learning2014)
# print out summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ deep + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## deep -0.7443 0.8662 -0.859 0.3915
## stra 0.9878 0.5962 1.657 0.0994 .
## surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
Does not seem too good. When doing multiple regression we are more interested in adjusted R-squared value, which is quite low, so the model doesn’t really explain. Let’s change up the model by switching one of the explanatory variables.
New model:
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Increase of 1 exam point is associated with 0.339 increase in attitude points. 0.853 is estimated effect of strategic learning to exam points adjusting for attitude. -0.586 is estimated effect of surface learning to exam points adjusting for attitude and strategic learning.
Adjusted R-squared is 0.1927, so almost 20% of variation in exam points can be explained by attitude, strategic learning and surface learning. Adjusted R-squared is the more important measure as we have multiple explanatory variables.
Let’s produce the plots: Residual vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# create the model again just in case
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# draw diagnostic plots using the plot() function. Choose the plots Residual vs Fitted values, Normal QQ-plot and Residuals vs Leverage (which are 1,2 and 5)
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
Residuals vs Fitted: Reasonable random spread of points. So there shouldn’t be a problem with the assumptions.
Normal Q-Q: In the middle there seems to quite good fit to the normality line, but in the beginning and in the end there is a deviation from the line. So there isn’t a reasonable with to the normality assumption.
Residuals vs Leverage: this plot can help to identify observations that have unusually high impact. There is seemingly few outlier values in the bottom of the plot. So these outliers may highly impact the model in question.
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", header = TRUE, sep = ",")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
Original data set description can be found here. The read data is combination of two data sets regarding the performance in Mathematics and Portuguese language. So originally two different data sets with same variables relating to student grades, demographic, social and school related features.
Variables/column names with suffix .p are originals from por (Portuguese language) dataset and .m from mat (Mathematics) dataset. The combination dataset in question has the following variables: “failures”, “paid”(first value selected), “absences”, “G1”, “G2”, “G3”, are averaged from mat and por datasets. “alc_use” is averaged from “Dalc” workday alcohol consumption (numeric: from 1 - very low to 5 - very high) and “Walc” - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high). “high_use” is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.
higher - wants to take higher education. I hypothesize that students not wanting to take higher education would have higher alcohol consumption.
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). I think family relationship might be an interesting variable that might be highly correlated with alcohol consumption as in students with bad family relationships would likely drink more and vice versa.
age - my hypothesis is that older persons will likely have higher alcohol consumption, as the age range is from 15 to 22, over 18 persons can buy alcohol legally.
failures - number of past class failures (numeric: n if 1<=n<3, else 4). I hypotize that high failures would correlate with high alcohol consumption.
Check our variable types etc.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
my_vars <- c("high_use", "higher", "age", "famrel", "failures")
interest_alc <- subset(alc, select = my_vars)
glimpse(interest_alc)
## Rows: 370
## Columns: 5
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, F~
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes"~
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
## $ famrel <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3, 4~
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
Let’s examine our selected variables’ “higher”’, “age”, “famrel”, “failures” and “high_use” count distuributions.
library(tidyr); library(ggplot2)
gather(interest_alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Interesting, it looks like most do want to take higher education. Let’s check the actual counts on that one and plot with them.
alc %>% group_by(higher) %>% summarise(count = n())
## # A tibble: 2 x 2
## higher count
## <chr> <int>
## 1 no 16
## 2 yes 354
alc %>% group_by(high_use, higher) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: high_use [2]
## high_use higher count
## <lgl> <chr> <int>
## 1 FALSE no 7
## 2 FALSE yes 252
## 3 TRUE no 9
## 4 TRUE yes 102
alc %>% group_by(high_use, higher) %>% summarise(count = n()) %>%
ggplot(aes(x = higher, y = count, fill = high_use)) +
geom_col() +
facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
I am somewhat suprised by this that the majority of the students want to continue higher education. So there is very few who overall who do not want to take higher education. Looks like my hypothesis is not withholding.
Summarize counts of high_use by famrel in table and in barplot.
alc %>% group_by(high_use, famrel) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups: high_use [2]
## high_use famrel count
## <lgl> <int> <int>
## 1 FALSE 1 6
## 2 FALSE 2 9
## 3 FALSE 3 39
## 4 FALSE 4 128
## 5 FALSE 5 77
## 6 TRUE 1 2
## 7 TRUE 2 9
## 8 TRUE 3 25
## 9 TRUE 4 52
## 10 TRUE 5 23
alc %>% group_by(high_use, famrel) %>% summarise(count = n()) %>%
ggplot(aes(x = famrel, y = count, fill = high_use)) +
geom_col() +
facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). Bad family relations was lower number and good were higher number. Looks like my hypothesis is rather poor in this one. The counts for high_use and bad family relationships are very low. Overall bad family relationships are seem to pretty low with students.
Summarize counts of high_use by failures in table and in barplot.
alc %>% group_by(high_use, failures) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 8 x 3
## # Groups: high_use [2]
## high_use failures count
## <lgl> <int> <int>
## 1 FALSE 0 238
## 2 FALSE 1 12
## 3 FALSE 2 8
## 4 FALSE 3 1
## 5 TRUE 0 87
## 6 TRUE 1 12
## 7 TRUE 2 9
## 8 TRUE 3 3
alc %>% group_by(high_use, failures) %>% summarise(count = n()) %>%
ggplot(aes(x = failures, y = count, fill = high_use)) +
geom_col() +
facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
I hypotized that high failures would be in relation with high alcohol consumption. Does not seem to be much difference in the groups. Overall failures count is very low and the spread of high alcohol consumption with more than 1 or more failures is very close in non high alcohol consumption with high alcohol consumption groups.
alc %>% group_by(high_use) %>% summarise(mean = mean(age))
## # A tibble: 2 x 2
## high_use mean
## <lgl> <dbl>
## 1 FALSE 16.5
## 2 TRUE 16.8
# initialise a plot of high_use and age
g2 <- ggplot(alc, aes(x = high_use, y = age))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student age by alcohol consumption")
# lets also seem the same with sex added
g3 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g3 + geom_boxplot() + ggtitle("Student age by alcohol consumption and sex")
alc %>% group_by(high_use, age) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups: high_use [2]
## high_use age count
## <lgl> <int> <int>
## 1 FALSE 15 63
## 2 FALSE 16 74
## 3 FALSE 17 62
## 4 FALSE 18 52
## 5 FALSE 19 7
## 6 FALSE 20 1
## 7 TRUE 15 18
## 8 TRUE 16 28
## 9 TRUE 17 35
## 10 TRUE 18 25
## 11 TRUE 19 4
## 12 TRUE 22 1
alc %>% group_by(high_use, age) %>% summarise(count = n()) %>%
ggplot(aes(x = age, y = count, fill = high_use)) +
geom_col() +
facet_wrap("high_use")
## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
Average age for higher use is slightly higher, but there does not seem to be a that much more higher age for high alcohol consuming students especially over or 18.
Overall looks like I chose poor variables that might be in very poor / nonexistent relation with high alcohol consumption.
We’ll perform logistic regression to explore the relationships between the famrel, age, higher, failures variables with high/low alcohol consumption as target variable
log_reg_model <- glm(high_use ~ famrel + age + higher + failures, data = alc, family = "binomial")
summary(log_reg_model)
##
## Call:
## glm(formula = high_use ~ famrel + age + higher + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6954 -0.8036 -0.7110 1.3402 1.8570
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7777 1.9341 -0.919 0.3580
## famrel -0.2815 0.1254 -2.245 0.0248 *
## age 0.1405 0.1029 1.365 0.1723
## higheryes -0.4499 0.5899 -0.763 0.4457
## failures 0.5594 0.2102 2.661 0.0078 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 432.08 on 365 degrees of freedom
## AIC: 442.08
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp
# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1690283 0.003738272 7.4864947
## famrel 0.7546270 0.589068018 0.9650944
## age 1.1508478 0.941172724 1.4104114
## higheryes 0.6376962 0.198545179 2.0808162
## failures 1.7495838 1.162299822 2.6710151
Seems a bit off since we can’t see other factors for famrel. Let’s add factors.
log_reg_model <- glm(high_use ~ factor(famrel) + age + factor(higher) + failures, data = alc, family = "binomial")
summary(log_reg_model)
##
## Call:
## glm(formula = high_use ~ factor(famrel) + age + factor(higher) +
## failures, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7501 -0.8559 -0.7101 1.2656 1.8934
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1286 2.0803 -1.504 0.13260
## factor(famrel)2 0.8850 0.9551 0.927 0.35411
## factor(famrel)3 0.5687 0.8622 0.660 0.50949
## factor(famrel)4 0.1298 0.8381 0.155 0.87693
## factor(famrel)5 -0.2315 0.8574 -0.270 0.78714
## age 0.1445 0.1036 1.395 0.16310
## factor(higher)yes -0.4172 0.5945 -0.702 0.48281
## failures 0.5515 0.2122 2.599 0.00934 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 429.86 on 362 degrees of freedom
## AIC: 445.86
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp
# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.04377755 0.0006880176 2.482282
## factor(famrel)2 2.42308808 0.4105189123 20.100563
## factor(famrel)3 1.76605639 0.3673243645 12.821709
## factor(famrel)4 1.13859072 0.2499432346 7.999589
## factor(famrel)5 0.79333156 0.1665015354 5.717950
## age 1.15541978 0.9437919758 1.417882
## factor(higher)yes 0.65888088 0.2035056764 2.170862
## failures 1.73582033 1.1481286518 2.658353
Let’s try model without intercept.
log_reg_model <- glm(high_use ~ factor(famrel) + age + factor(higher) + failures -1, data = alc, family = "binomial")
summary(log_reg_model)
##
## Call:
## glm(formula = high_use ~ factor(famrel) + age + factor(higher) +
## failures - 1, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7501 -0.8559 -0.7101 1.2656 1.8934
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## factor(famrel)1 -3.1286 2.0803 -1.504 0.13260
## factor(famrel)2 -2.2436 1.9831 -1.131 0.25792
## factor(famrel)3 -2.5599 1.9126 -1.338 0.18075
## factor(famrel)4 -2.9988 1.9279 -1.556 0.11982
## factor(famrel)5 -3.3601 1.9482 -1.725 0.08458 .
## age 0.1445 0.1036 1.395 0.16310
## factor(higher)yes -0.4172 0.5945 -0.702 0.48281
## failures 0.5515 0.2122 2.599 0.00934 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 512.93 on 370 degrees of freedom
## Residual deviance: 429.86 on 362 degrees of freedom
## AIC: 445.86
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(log_reg_model) %>% exp
# compute confidence intervals (CI)
CI <- confint(log_reg_model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## factor(famrel)1 0.04377755 0.0006880176 2.482282
## factor(famrel)2 0.10607686 0.0021202344 5.145847
## factor(famrel)3 0.07731362 0.0017718458 3.260660
## factor(famrel)4 0.04984471 0.0011021640 2.152387
## factor(famrel)5 0.03473011 0.0007347747 1.554473
## age 1.15541978 0.9437919758 1.417882
## factor(higher)yes 0.65888088 0.2035056764 2.170862
## failures 1.73582033 1.1481286518 2.658353
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# Structure and dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Original data description can be found here.
The Boston data frame is of housing values in suburbs of Boston and it has 506 rows and 14 columns.
crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratio - pupil-teacher ratio by town.
black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Plot matrix of variables
#fig1, fig.height = 19, fig.width = 17}
library(ggplot2)
library(GGally)
# remove chas variable which is categorical
Boston_cont <- subset(Boston, select = -chas)
ggpairs(Boston_cont,
upper = list(continuous = wrap('cor', size = 4)),
title = "Scatterplot matrix of `Boston`")
Very few variable seems to be truely normally distributed. rm (average number of rooms per dwelling) variable seem to be the closest.
Let’s take a more graphical look at the correlations of the variables:
#fig.height = 12, fig.width = 8}
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
In the correlation matrix visualization above the more red color symbolizes negative correlation the bigger it is. The positive correlation is presented by blue tint and the bigger the circle, bigger the (positive).
Few picks on the correlation matrix:
crim “per capita crime rate by town” is most positively correlated with rad “index of accessibility to radial highways”.
indus “proportion of non-retail business acres per town.” is negatively correlated with dis “weighted mean of distances to five Boston employment centres”. So… bigger proportion of non-retail business acres per town is situated away from Boston’s five employment centers, in other words, industy areas are situated away from center’s of the towns.
Standardize and scale the variables in the dataset:
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Standardized and scaled variables now have negative values as before there were only positive values. The values are now in “the same ball park” for all the variables.
Here we create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate) using quantiles (and renaming them “low”, “med_low”, “med_high”, “high”) and we drop the original crime rate variable out:
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Here we divide the dataset used for training (80% of data) and using to test (20% of the data):
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
Here we fit the linear discriminant analysis (LDA) on the training set. The previously modified categorical crime rate (“low”, “med_low”, “med_high”, “high”) is used as target variable and everything else is used as predictors.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2425743 0.2648515 0.2351485
##
## Group means:
## zn indus chas nox rm age
## low 0.89760886 -0.9055446 -0.083045403 -0.8633595 0.47213621 -0.8643913
## med_low -0.07618846 -0.3013237 0.008892378 -0.5785672 -0.07624802 -0.3586401
## med_high -0.38946429 0.1645030 0.206010213 0.3610501 0.15513588 0.4071430
## high -0.48724019 1.0172655 0.017773055 1.0660034 -0.39118806 0.7945294
## dis rad tax ptratio black lstat
## low 0.8410431 -0.6991716 -0.7441105 -0.4453634 0.38009349 -0.76546551
## med_low 0.3791867 -0.5342034 -0.4709942 -0.1147330 0.35547911 -0.15819719
## med_high -0.3588030 -0.3861710 -0.2899262 -0.2941608 0.07462915 -0.08705448
## high -0.8455703 1.6366336 1.5129868 0.7790365 -0.80855104 0.85791313
## medv
## low 0.54311425
## med_low 0.03970303
## med_high 0.22435925
## high -0.66681928
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11271945 0.72184857 -0.97572163
## indus -0.01949366 -0.26199204 0.37915333
## chas -0.06805971 -0.03873991 0.08191538
## nox 0.42445425 -0.66603176 -1.32782291
## rm -0.08497623 -0.08819147 -0.10910245
## age 0.24362107 -0.33755599 -0.18211929
## dis -0.09585201 -0.21736194 0.22795726
## rad 2.88419830 0.95649157 -0.01929760
## tax 0.07416565 -0.06475131 0.52113724
## ptratio 0.12713013 0.06280466 -0.36644490
## black -0.16733577 0.03157892 0.22136894
## lstat 0.24766507 -0.16115794 0.52075821
## medv 0.22377498 -0.38527226 -0.14788436
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9441 0.0412 0.0147
Let’s draw the LDA (bi)plot:
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Seems that most of the high crime rate is best predicted by rad - index of accessibility to radial highways. zn - proportion of residential land zoned for lots over 25,000 sq.ft. seems to best predict low crime rates. So a defiency of small houses predicts lower crime rate…
nox - nitrogen oxides concentration (parts per 10 million). seems to predict med_high to high crime rate as well.
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 6 0 0
## med_low 4 17 7 0
## med_high 1 4 14 0
## high 0 0 0 32
High crime rates especially seem to be most correctly predicted using the training data. Also low and med_low are predicted correctly most of the cases. The weakest prediction is for med_low crime rate. Note that the sampling (where we choose 80% for training) might have an effect on how well the med_low and med_high are predicted as their prediction is not as “easy” as for low and high crime rate. The more of a cohesive cluster of the target variable (for example in the previous plot) the more easier it is to predict.
Here we standardize the data with comparable distances and calculate the distances between the observations.
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
It is somewhat unclear whether to explore the k-means algorithm with the SCALED data… but here we go:
# fig.height = 19, fig.width = 17}
km <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
# fig.height = 19, fig.width = 17}
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
With the amount of two clusters you can get pretty decent separation between most variables.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
Read data to object human2, make “first” column as rownames.
human2 <- read.table(file = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", header = TRUE, sep = ",", row.names = 1)
# check dimensions (155, 8)
dim(human2)
## [1] 155 8
Let’s check the structure of the data first.
str(human2)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
All variables are numeric.
Let’s check summary.
summary(human2)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Edu2.FM was “Proportion of females with at least secondary education” divided with “Proportion of males with at least secondary education”. So values under 1 are means the proportion of females with at least secondary education is lower than in males. If it’s over 1 then females have higher proportion of at least secondary education to males. Looks like females typically have lower proportion of at least secondary education to males (per country)
Labo.FM was “Proportion of females in the labour force” divided by “roportion of males in the labour force”. Same as with previous variable, lower than 1 means the proportion of females working is lower than proportion of males working. This seems to be the case overall.
Edu.Exp is expected years of schooling per country. Minimum value is super low, mean is 13,18 years.
Life.Exp is Life expectancy at birth.
GNI Gross National Income per capita. Really big differences here. Let’s see more with graphical overview
Mat.Mor Maternal mortality ratio is deaths per 100,000 live births. Minimum being 1 death per 100,000 and highest value being 1100. Mean on the other hand is 149,1. So seem the values are really skewed to lower values.
Ado.Birth is Adolescent birth rate is births per 1,000 women ages 15–19.
Parli.F Percetange of female representatives in parliament. Minimum being 0, so all male parliment representatives. Highest reaching only 57,5% of females in parliment. Mean is ~21%.
library(ggplot2)
library(GGally)
ggpairs(human2,
upper = list(continuous = wrap('cor', size = 4)),
title = "Scatterplot matrix of selected variables from Human Development Report 2015")
Figure 1. Scatterplot/correlations of human2 variables
Life.Exp is positively correlated with Edu.Exp. Higher life expectancy correlates with higher expected years in school.
GNI is positively correlated with Edu.Exp. Higher Gross National Income per capita correlates with higher expected years in school. GNI also positively correlates with Life.Exp. Higher gross national income per capita correlates with higher life expectancy.
Mat.mor is negatively correlated with Edu2.FM. Higher Maternal mortality ratio correlates with lower proportion of females with at least secondary education to males. Also higher adolescent birth rate Ado.Birth correlates with lower Edu2.FM.
Higher adolescent birth rate Ado.birth correlates with lower amount of expected years in school (Edu.Exp), lower life expectancy (Life.Exp), lower gross National Income per capita (GNI) and higher maternal mortality rate (Mat.Mor).
Principal component analysis of the non standardized data.
# perform pca
pca_human <- prcomp(human2)
# summary of pca_human
s <- summary(pca_human)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human,
cex = c(0.8, 1),
col = c("grey40", "deeppink2"),
xlab = pc_lab[1],
ylab = pc_lab[2])
Figure 2. biplot of non standardized values of PCA - First prinicipal component explains 100% of the variance in non-standardized data. Second, third etc. principal components explain close to nothing (rounds up to zero). This is typical if values are not measured on the same scale.
First prinicipal component explains 100% of the variance in non-standardized data.
Principal component analysis of standardized data.
# standardize variables
human_std <- scale(human2)
# perform pca
pca_human <- prcomp(human_std)
# summary of pca_human
s <- summary(pca_human)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human,
cex = c(0.8, 1),
col = c("grey40", "deeppink2"),
xlab = pc_lab[1],
ylab = pc_lab[2],
caption = "testiiii asgag0aw=-ghjsjhopfahopjdporfjsdpofjsdpojfpsojhgpo4306q0-12-0ra",
captionLabSize = 10)
Figure 3. biplot of PCA of standardized data - Unlike in previous figure 2. (section 3.) now the first principal component explains 53.6% of the variance and second principal component explains 16.2% of the variance, pc3 9.6% etc.
Unlike in previous section 3. now the first principal component does not explain 100% of the variance, but 53.6%. Second principal component explains 16.2% of the variance, pc3 9.6% etc.
Our variables are not measured on the same scale. By doing standardization we assign equal importance to all variables.
Mostly on the first principal component axis Edu.Exp, CNI, Edu2.FM, Life.Exp are going (almost) to the same direction. The countries clustering to the this direction on the axis of first principal component (x-axis) have high educational expectancy, high CNI, high life expectancy, high female to male proportion ratio in at least secondary education. Vector lengths (arrows) describe the size of the effect of that variable on the x-axis for the country.
Mat.Mor and Abo.Birth going to the opposite direction then the previously described group of variables. So High maternal mortality and Adolescent birth rate correlate with each other and the effect of these variables are mostly on the first principal component axis (x-axis).
On the second principal component axis (y-axis) high Parli.F - percentage of female parliment representatives and Labo.FM high female to male proportion ratio of people in labour force are going almost to the same direction, but obviously are little skewed on the x-axis.
There is quite a lot we can say from this picture. For example the cluster of Nordic countries Iceland, Sweden, Norway, Finland, Denmark in upper left side of the previous plot.
These countries have might one or more of the following high:
- Educational expectancy
- Life expectancy
- Gross National Income per capit
- female to male proportion ratio in at least secondary education
- percentage of female parliment representatives
- high female to male proportion ratio of people in labour force
Qatar has one or more of the following high:
- Educational expectancy
- Life expectancy
- Gross National Income per capit
- female to male proportion ratio in at least secondary education
..but quite low
- percentage of female parliment representatives
- high female to male proportion ratio of people in labour force
Iran, Syria, Lebanon, Jordan, Yeme islamic countries have one or more of the following lowest:
- percentage of female parliment representatives
- high female to male proportion ratio of people in labour force
Mozambique, Niger, Mali, Chad, Central African Republic, Congo, Sierra Leone etc. African countries have one or more of the following highest:
- maternal mortality
- adolescent birth rate
Mozambique also has high:
Not all the variables explain the same amount of effect for each country. In some the effects of the variables might differ a lot.
Let’s install FactoMineR if we already do not have it. Load the FactoMineR library, load tea data and check tea data’s structure.
#install.packages("FactoMineR")
library(FactoMineR)
# load data
data(tea)
# check structure
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
?tea
## starting httpd help server ... done
The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).
Okay, so everything else is categorical variable (factor) with most having two levels (two different options/values), except age which is continuous (integer). 36 variables with 300 rows/observations.
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
library(dplyr)
library(ggplot2)
library(tidyr)
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Visualization of few (categorical) variables of tea dataset with level of the variable on x-axis and count on y-axis.
Most people drink tea made from tea bags bought from chain store, most likely Earl Grey, drink it without any additional substances (lemon, milk, other) mostly not during on lunch. Close to have drink their tea with sugar.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
#plot(mca, invisible=c("ind"), habillage = "quali")
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
Figure 4. MCA plot for tea dataset
With the MCA plot we can get some patterns between categorical variables.
For example on the lower right corner we have unpackaged and tea shop clustered together. Likely users who use unpackaged tea to make their tea shop their tea from tea shop.
On the mid-low left side of the plot we have tea bag and chain store clustered together. Likely people who enjoy their tea by tea bags buy them from chain stores most of the time.
Let’s
# multiple correspondence analysis
teaa <- select(tea, -age)
mca <- MCA(teaa, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = teaa, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.090 0.082 0.070 0.063 0.056 0.053 0.050
## % of var. 5.838 5.292 4.551 4.057 3.616 3.465 3.272
## Cumulative % of var. 5.838 11.130 15.681 19.738 23.354 26.819 30.091
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.048 0.047 0.044 0.041 0.040 0.039 0.037
## % of var. 3.090 3.053 2.834 2.643 2.623 2.531 2.388
## Cumulative % of var. 33.181 36.234 39.068 41.711 44.334 46.865 49.252
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.036 0.035 0.034 0.032 0.031 0.031 0.030
## % of var. 2.302 2.275 2.172 2.085 2.013 2.011 1.915
## Cumulative % of var. 51.554 53.829 56.000 58.086 60.099 62.110 64.025
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.028 0.027 0.026 0.025 0.025 0.024 0.024
## % of var. 1.847 1.740 1.686 1.638 1.609 1.571 1.524
## Cumulative % of var. 65.872 67.611 69.297 70.935 72.544 74.115 75.639
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 0.023 0.022 0.021 0.020 0.020 0.019 0.019
## % of var. 1.459 1.425 1.378 1.322 1.281 1.241 1.222
## Cumulative % of var. 77.099 78.523 79.901 81.223 82.504 83.745 84.967
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.018 0.017 0.017 0.016 0.015 0.015 0.014
## % of var. 1.152 1.092 1.072 1.019 0.993 0.950 0.924
## Cumulative % of var. 86.119 87.211 88.283 89.301 90.294 91.244 92.169
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 0.014 0.013 0.012 0.011 0.011 0.010 0.010
## % of var. 0.891 0.833 0.792 0.729 0.716 0.666 0.660
## Cumulative % of var. 93.060 93.893 94.684 95.414 96.130 96.796 97.456
## Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.009 0.009 0.008 0.007 0.006
## % of var. 0.605 0.584 0.519 0.447 0.390
## Cumulative % of var. 98.060 98.644 99.163 99.610 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.580 1.246 0.174 | 0.155 0.098 0.012 | 0.052 0.013
## 2 | -0.376 0.522 0.108 | 0.293 0.350 0.066 | -0.164 0.127
## 3 | 0.083 0.026 0.004 | -0.155 0.099 0.015 | 0.122 0.071
## 4 | -0.569 1.196 0.236 | -0.273 0.304 0.054 | -0.019 0.002
## 5 | -0.145 0.078 0.020 | -0.142 0.083 0.019 | 0.002 0.000
## 6 | -0.676 1.693 0.272 | -0.284 0.330 0.048 | -0.021 0.002
## 7 | -0.191 0.135 0.027 | 0.020 0.002 0.000 | 0.141 0.095
## 8 | -0.043 0.007 0.001 | 0.108 0.047 0.009 | -0.089 0.038
## 9 | -0.027 0.003 0.000 | 0.267 0.291 0.049 | 0.341 0.553
## 10 | 0.205 0.155 0.028 | 0.366 0.546 0.089 | 0.281 0.374
## cos2
## 1 0.001 |
## 2 0.021 |
## 3 0.009 |
## 4 0.000 |
## 5 0.000 |
## 6 0.000 |
## 7 0.015 |
## 8 0.006 |
## 9 0.080 |
## 10 0.052 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## breakfast | 0.182 0.504 0.031 3.022 | 0.020 0.007 0.000 0.330 |
## Not.breakfast | -0.168 0.465 0.031 -3.022 | -0.018 0.006 0.000 -0.330 |
## Not.tea time | -0.556 4.286 0.240 -8.468 | 0.004 0.000 0.000 0.065 |
## tea time | 0.431 3.322 0.240 8.468 | -0.003 0.000 0.000 -0.065 |
## evening | 0.276 0.830 0.040 3.452 | -0.409 2.006 0.087 -5.109 |
## Not.evening | -0.144 0.434 0.040 -3.452 | 0.214 1.049 0.087 5.109 |
## lunch | 0.601 1.678 0.062 4.306 | -0.408 0.854 0.029 -2.924 |
## Not.lunch | -0.103 0.288 0.062 -4.306 | 0.070 0.147 0.029 2.924 |
## dinner | -1.105 2.709 0.092 -5.240 | -0.081 0.016 0.000 -0.386 |
## Not.dinner | 0.083 0.204 0.092 5.240 | 0.006 0.001 0.000 0.386 |
## Dim.3 ctr cos2 v.test
## breakfast -0.107 0.225 0.011 -1.784 |
## Not.breakfast 0.099 0.208 0.011 1.784 |
## Not.tea time 0.062 0.069 0.003 0.950 |
## tea time -0.048 0.054 0.003 -0.950 |
## evening 0.344 1.653 0.062 4.301 |
## Not.evening -0.180 0.864 0.062 -4.301 |
## lunch 0.240 0.343 0.010 1.719 |
## Not.lunch -0.041 0.059 0.010 -1.719 |
## dinner 0.796 1.805 0.048 3.777 |
## Not.dinner -0.060 0.136 0.048 -3.777 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.031 0.000 0.011 |
## tea.time | 0.240 0.000 0.003 |
## evening | 0.040 0.087 0.062 |
## lunch | 0.062 0.029 0.010 |
## dinner | 0.092 0.000 0.048 |
## always | 0.056 0.035 0.007 |
## home | 0.016 0.002 0.030 |
## work | 0.075 0.020 0.022 |
## tearoom | 0.321 0.019 0.031 |
## friends | 0.186 0.061 0.030 |
# visualize MCA
#plot(mca, invisible=c("ind"), habillage = "quali")
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
Figure 5. MCA plot for tea dataset with all categorical variables
#library(factoextra)
#fviz_mca_biplot(mca,
# repel = TRUE,
# ggtheme = theme_minimal())
With all other variables excluding continuous variable age we get this kind of plot. NOTE! With more categorical variables we get a lot less explanation of the variance by our correspondence dimensions only 5.84% for dim1 and 5.29% for dim2 compared to Figure 4. 15.24% for dim1 and 14.23% for dim2!!!
In this part 1. of chapter 6 (exercise 6.) we will conduct analysis of RATS data by taking inspiration from Chapter 8 of MABS.
Load libraries
# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
library(ggplot2)
Read data. We will read the data prepared in data wrangling part
RATS_long <- read.table(file = "https://raw.githubusercontent.com/oqe/IODS-project/master/data/RATS_long.tsv", sep = "\t", header = TRUE)
Check data, factor variables ID and Group.
head(RATS_long)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
str(RATS_long)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
RATS_long$ID <- factor(RATS_long$ID)
RATS_long$Group <- factor(RATS_long$Group)
glimpse(RATS_long)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,~
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, ~
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", ~
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470~
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, ~
summary(RATS_long)
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
Let’s try to view our data graphically by plotting it different ways
# Draw the plot
ggplot(RATS_long, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
#scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATS_long$Weight), max(RATS_long$Weight)))
Figure 1. Line plot of individual rats’ weights by group
Here we can already see possible outlier individuals of each group. In group 1 there seems to be one rat with lower weights over time then most other rats in the group. In group 2 there seems to be a rat with quite a lot higher weights than the other rats.
As in the Chapter 8 of MABS with BPRS data, let’s see what we get when we standardize weight…
# Standardise the variable Weight
RATS_long <- RATS_long %>%
group_by(Time) %>%
mutate(stdWeight = (Weight-mean(Weight))/sd(Weight)) %>%
ungroup()
# Glimpse the data
glimpse(RATS_long)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,~
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ~
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1~
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ~
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ~
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, ~
Here we plot the standardized weight…
# Plot again with the standardised Weight
ggplot(RATS_long, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized Weight")
Figure 2. Line plot of individual rats’ standardized weights by group
Perhaps standardization by weight is not the best way to go about this as the weight increases over time and is a tangible and a single measured variable compared to BPRS dataset’s bprs variable which is an aggregate variable…
> The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe).
Let’s continue with few more plots of the data. For the next plot we’ll summarize mean and standard error by groups.
n <- RATS_long$Time %>% unique() %>% length()
# Summary data with mean and standard error of bprs by treatment and week
RATS_long_S <- RATS_long %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Glimpse the data
glimpse(RATS_long_S)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, ~
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2~
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30~
# Plot the mean profiles
ggplot(RATS_long_S, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.4)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Figure 3. Mean weights with standard error of rat feeding groups
Time is in days.
Weight is in grams.
Here we can see that the group 1. has very little difference with in the individual rat’s weight (small standard error bar).
Group 3 and 2 have significantly higher weights compared to group 1.
Group 2 rats have biggest variance in the weight between the individual rats (bigger standard error bar).
ggplot(RATS_long, aes(x = factor(Time), y = Weight, fill = Group)) +
geom_boxplot() +
stat_summary(fun.y = mean, color = "darkred", position = position_dodge(0.75),
geom = "point", shape = 18, size = 3)
## Warning: `fun.y` is deprecated. Use `fun` instead.
Figure 4. Boxplot of rat groups by weight
As in Figure 1. we theorized possible outlier individuals, in this figure 3. we can see that outlier values are likely from those individual rats. Mean values of the groups by measurement day are in red diamond shape. Outliers are black dots.
In group 1. we have low value outliers in all but day 1 measurements. In group 2. we have outliers in all measurement days except first day and the outlier values are very high compared to mean. In group 3. we have some outliers that seem to be lower values.
Here we draw another boxplot aggregating now overall time:
RATS_long_o <- RATS_long %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Draw a boxplot of the mean versus treatment
ggplot(RATS_long_o, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days 1-64")
## Warning: `fun.y` is deprecated. Use `fun` instead.
Figure 5. Boxplots of rat weights by group
Let’s remove the outliers by each group… seems a bit dodgy thing to do, but let’s try it.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATS_long_o1 <- RATS_long_o %>%
filter(!(Group == "1" & mean < 250)) %>%
filter(!(Group == "2" & mean > 550)) %>%
filter(!(Group == "3" & mean < 500))
RATS_long_o1
## # A tibble: 13 x 3
## Group ID mean
## <fct> <fct> <dbl>
## 1 1 1 261.
## 2 1 3 260.
## 3 1 4 267.
## 4 1 5 269.
## 5 1 6 275.
## 6 1 7 275.
## 7 1 8 265.
## 8 2 9 441.
## 9 2 10 453.
## 10 2 11 455.
## 11 3 14 536.
## 12 3 15 540.
## 13 3 16 534.
Plot data with outliers removed:
ggplot(RATS_long_o1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days 1-64")
## Warning: `fun.y` is deprecated. Use `fun` instead.
Figure 6. Boxplots of rat weights by group with outlier removed
Looks too good to be true, but let’s head on anyways.
# Fit the linear model with the mean as the response
fit <- lm(mean ~ Group, data = RATS_long_o1)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 175958 87979 2577.4 2.721e-14 ***
## Residuals 10 341 34
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fit the linear model with the mean as the response
fit <- lm(Weight ~ Group, data = RATS_long)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: Weight
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 2604050 1302025 998.17 < 2.2e-16 ***
## Residuals 173 225662 1304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Groups differ from each other significantly.
In this part 2. of chapter 6 (exercise 6.) we will conduct analysis of BPRS data by taking inspiration from Chapter 9 of MABS.
Read data.
BPRS_long <- read.table(file = "https://raw.githubusercontent.com/oqe/IODS-project/master/data/BPRS_long.tsv", sep = "\t", header = TRUE)
Check data, factor variables ID and Group.
head(BPRS_long)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
str(BPRS_long)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
BPRS_long$subject <- factor(BPRS_long$subject)
BPRS_long$treatment <- factor(BPRS_long$treatment)
glimpse(BPRS_long)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0~
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, ~
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
summary(BPRS_long)
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
Let’s try to create Figure 9.1 of Chapter 9 of MABS using the BPRS data.
ggplot(BPRS_long, aes(x = week, y = bprs, color = treatment)) +
geom_point() +
geom_text(aes(label=treatment),size = 4, hjust=0, vjust=0)
Figure 1. Point plot of bprs scores, coloring by treatment group
Pretty difficult to catch anything meaningful from the plot figure.
ggplot(BPRS_long, aes(x = week, y = bprs, color = subject)) +
geom_line() +
scale_x_continuous(name = "Week") + # breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top") +
facet_grid(. ~ treatment, labeller = label_both)
This one is also pretty difficult to interpret.
Let’s plot treatment bprs by subject and week with both treatments…
ggplot(BPRS_long, aes(x = week, y = bprs, color = treatment)) +
geom_line() +
scale_x_continuous(name = "Week") + # breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top") +
facet_wrap(. ~ subject, ncol =3, labeller = label_both)
Figure 2. Line plot of bprs scores, coloring by subject
#facet_grid(. ~ subject, labeller = label_both)
This is a lot nicer. Here we can see individual subject bprs score by week for each treatment.
Continuing to ignore the repeated-measures structure of the data, we will fit a multiple linear regression model with bprs as response and week and treatment as explanatory variables.
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, BPRS_long)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS_long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Our linear model is not very good. Potential model could explain 18% of variance.
The previous model assumes independence of the repeated measures of weight, and this assumption is highly unlikely.
Let’s try out different more appropriate models and graphics.
We’ll first first fit the random intercept model for the same two explantary variables: Week and treatment. With this we can fit linear regression for each subject to differ in intercept from other subjects.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_long, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS_long
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_long, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS_long
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS_long
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fit a random intercept and slope model that allows for a treatment × Week interaction.
# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + week * treatment + (week | subject), data = BPRS_long, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
## Data: BPRS_long
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRS_long
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fitting this model can be seen the above table.
# draw the plot of RATSL with the observed Weight values
ggplot(BPRS_long, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (Weeks)") +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top") +
facet_grid(. ~ treatment, labeller = label_both) +
ggtitle("Observed")
Figure 3. Observed bprs scores profile
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
# Create a new column fitted to RATSL
BPRS_long <- mutate(BPRS_long, fitted = Fitted)
# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRS_long, aes(x = week, y = fitted, group = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)") +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "top") +
facet_grid(. ~ treatment, labeller = label_both) +
ggtitle("Fitted")
Figure 4. Fitted bprs scores profile from interaction model
Fitted values from the interaction model and plot the fitted bprs score rates for each subject are shown in Figure 4. with observed values in Figure 3. above. The observed values are quite a mess but there is definitely a downwards trend in time. ***
(more chapters to be added similarly as we proceed with the course!)